-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clinton/ed 448/sdk html and text #820
Conversation
Unit test report (Pydantic 1.x)206 tests 206 ✅ 5s ⏱️ Results for commit 32348fb. ♻️ This comment has been updated with latest results. |
Unit test report ((Pydantic 2.x)206 tests 206 ✅ 6s ⏱️ Results for commit 32348fb. ♻️ This comment has been updated with latest results. |
file_type=label_row_dict.get("file_type", None), | ||
file_type=file_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the longest time, this has just been None
lolol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we should be able to get the file_type on first requesting the label_row and instead of having to be included in the blob when we initialise labels:
i.e: The above method is called in lr.initialise_labels().
Probably worth checking whether we get the file_type before and we just don't provide it in the label blob and then here it gets overridden by the empty field in the actual label blob. Pretty confident that this is the case, checked. Let me know if this comment is obtuse cause this is a bad change somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes your comment is correct. I'm unsure as to what you want me to do though.
Are you trying to suggest that we don't overwrite it? And keep what we have in file_type before calling initialise_labels?
if range_html is not None: | ||
if self._ontology_object.shape != Shape.TEXT: | ||
raise LabelRowError( | ||
"Setting range_html of the object instance is only allowed for objects with the 'text' shape" | ||
) | ||
elif self._parent is not None and self._parent.file_type != "text/html": | ||
raise LabelRowError("Cannot add range_html to a non-html text file") | ||
elif len(self.range_list) > 0: | ||
raise LabelRowError("Cannot add range_html to an object instance that has a non-html range") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An important set of checks to ensure range_html is used properly on an ObjectInstance.
if len(range_html) > 0: | ||
self._range_html = range_html | ||
else: | ||
self._range_html = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows a user to REMOVE the range_html, by setting it to an empty array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of comments. Please take care that you aren't breaking historical backwards compatiblity around the file type changes.
annotation = obj.get_annotations()[0] | ||
ret[obj.object_hash]["range_html"] = [x.to_dict() for x in obj.range_html] | ||
ret[obj.object_hash]["range"] = [] | ||
ret[obj.object_hash]["createdBy"] = annotation.created_by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
Why so much C&P. Validation of non-type specific objects should be in a unified way in a unified place. i.e: Do createdBy stuff first and then use if statements to validate other fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha on C&P.
I don't understand the validation stuff.
elif self.data_type == DataType.PLAIN_TEXT or self.data_type == DataType.PDF: | ||
pass | ||
elif ( | ||
self.data_type == DataType.IMAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style:
Could use your introduced geometric_types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't work with exhaustive guard.
|
||
ret["labels"] = self._to_encord_labels(frame_level_data) | ||
|
||
if self._label_row_read_only_data.duration is not None: | ||
if self._label_row_read_only_data.duration is not None and self.data_type != DataType.PLAIN_TEXT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this condition?
Can we not ensure that for data_type PLAIN_TEXT, that we don't provide a duration.
It's not entirely clear to me why we need the original branching condition given that it seems fine to propagate the None into the ret dictionary. Maybe this is just someone else's ugly serialisation code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea you're right about the original branching condition. Ideally, we only have to check the data_type before adding properties to the data_unit
, and not have to check whether the value from _label_row_read_only_data
is None
or not.
Maybe there was a historical reason for this? If no discernible reason, will clean up this function (albeit probably in a separate PR to reduce surface area)
@sergei-encord any thoughts on this?
file_type=label_row_dict.get("file_type", None), | ||
file_type=file_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we should be able to get the file_type on first requesting the label_row and instead of having to be included in the blob when we initialise labels:
i.e: The above method is called in lr.initialise_labels().
Probably worth checking whether we get the file_type before and we just don't provide it in the label blob and then here it gets overridden by the empty field in the actual label blob. Pretty confident that this is the case, checked. Let me know if this comment is obtuse cause this is a bad change somewhat.
if "createdAt" in d and d["createdAt"] is not None: | ||
created_at = parse_datetime(d["createdAt"]) | ||
else: | ||
created_at = datetime.now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮
This doesn't make sense to me.
On creation yes, we should be providing a created_at timestamp. But I strongly disagree that on fail find, we should fallback to providing now as the created_at. I'd rather have None or the Unix Epoch time than now.
If you are using this branch as a mechanism somehow of updating created_at, if this is the first time creating labels, then I would reconsider refactoring such that it doesn't have this risky side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that is very correct! I put this in at the start just to get it working. CreatedAt should always be there, no defaults required.
label_row.add_classification_instance(classification_instance) | ||
assert len(label_row.get_classification_instances()) == 1 | ||
classification_instance.set_for_frames(Range(start=2000, end=2499)) | ||
range_list = classification_instance.range_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may just be personal:
This shouldn't be an ordered list imo, unless we are sorting it on requesting to ensure that we sort by the start index (If you are doing that then great!). The actual notion of classifications as you might view them in the label editor has the chronological ordering or here the lexicographic but if we are guaranteeing a stable order, it really needs to be derived from that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't quite understand this. ChatGPT suggests that:
They are questioning whether range_list is inherently or intentionally ordered. If it's being treated as an ordered list, they are asking whether you are explicitly sorting it by a meaningful criterion (e.g., the start index of ranges).
If that is the case, then yes, range_list
is being sorted by start index of ranges. See the add_range
function in encord/common/range_manager.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, yeah I was asking whether it has a stable ordering
offset (int): The offset of the content from the xpath | ||
""" | ||
|
||
node: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should call this xpath
? ATM node
is already in the class name HtmlNode, so feels like the field name can be more descriptive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the actual range_html
in the objects_index
, we call it "node" though.
Worried it might be confusing to have it called "node" in the exports and in our DB, but then have it called "xpath" here in this one place.
fc26087
to
92ec53a
Compare
6fac349
to
1084f6f
Compare
51454cc
to
b6f0245
Compare
encord/objects/coordinates.py
Outdated
range (Ranges): Ranges of chars for simple text files | ||
""" | ||
|
||
range_html: Optional[List[HtmlRange]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be plural? ranges
and ranges_html
(and same for Audio above? could be fine to change since no one is using it yet I assume)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeap agreed, but in the label format, we just call it range
though. So using the same here for consistency.
@@ -31,7 +31,7 @@ | |||
|
|||
from encord.common.range_manager import RangeManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be reasonable to expect label_row.add_classification_instance(cls_instance)
to work for text/html/audio without having to call cls_instance.set_for_frames
, given that we only support global classifications for these?
return f"{self.start.node}-{self.start.offset}-{self.end.node}-{self.end.offset}" | ||
|
||
@classmethod | ||
def from_dict(cls, d: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can use this method to be able to construct TextCoordinates.range_html from a dict rather than having to import HtmlRange & HtmlNode?
No big deal though, it's fine as is
encord/objects/coordinates.py
Outdated
range_html: Optional[List[HtmlRange]] = None | ||
range: Optional[Ranges] = None | ||
|
||
def __post_init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could separate TextCoordinates
and HtmlCoordinates
to not have to do the logic below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense!
encord/objects/coordinates.py
Outdated
""" | ||
|
||
range_html: Optional[List[HtmlRange]] = None | ||
range: Optional[Ranges] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be nice for the text ranges to be able to be a list of int
as well?
b6f0245
to
5c18d7a
Compare
dbcb6c3
to
3ee74fc
Compare
…on text files - Only allow classifications on frame=0 for non-geometric files
f14e19b
to
ea71694
Compare
Introduction and Explanation
Add object and classification support for html-text
Some things left to do:
ClassificationInstance.set_for_frames
on a non-geometric label row (i.e. audio or text), validate the rangesJIRA
Link ticket(s)
Documentation
There should be enough internal documentation for a product owner to write customer-facing documentation or a separate PR linked if writing the customer documentation directly. Link all that are relevant below.
Tests
Make a quick statement and post any relevant links of CI / test results. If the testing infrastructure isn’t yet in-place, note that instead.
Known issues
If there are any known issues with the solution, make a statement about what they are and why they are Ok to leave unsolved for now. Make tickets for the known issues linked to the original ticket linked above